-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Prune add file entry using columns stats in Delta checkpoint iterator #25311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ec3ab17
to
eda4dc9
Compare
0882a21
to
0cb9384
Compare
7b5cc12
to
cf01677
Compare
Is it flaky ? |
289beae
to
cf01677
Compare
|
fbdf5d3
to
9385c8d
Compare
The stats that were recorded in the checkpoint file(stats) has sth like '1960-01-01 01:02:03' which is the varchar type, but after create or replace the type changed to the |
66879be
to
76333ea
Compare
76333ea
to
7d1bdf0
Compare
Pls address the code conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we see a test in TestDeltaLakeFileOperations
to showcase on the file access level the effectiveness of this contribution ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be hard to observe a noticeable difference in file access, because we already prune AddFile entries using column statistics during split generation, specifically in DeltaLakeSplitManager#getSplits
and FileBasedTableStatisticsProvider#getTableStatistics
.
The commit is apply that pruning earlier, by filtering out AddFile entries that don’t satisfy the predicates while reading from the checkpoint, what we can see is that the size of add entries comes from is pruned, which shows in the added test
7d1bdf0
to
4c0ee56
Compare
...st/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TestCheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
Pls add a high-level description to add potential reviewers now or later in a few months to understand why was this contribution added. |
9f3b998
to
bff16a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase to latest master
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
e0bf36f
to
5dd7a3f
Compare
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/util/DeltaLakeDomains.java
Outdated
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/util/DeltaLakeDomains.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request optimizes the Delta checkpoint iterator by pruning AddFile entries based on column statistics earlier in the processing pipeline. Key changes include:
- Introducing early filtering using a union of partition and non-partition constraints.
- Adding tests to verify the behavior for various statistics and constraint scenarios.
- Adjusting API visibility and renaming variables to better represent the union of constraints.
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
spark_default_format_timestamp_timezone_statistics.json | Added test JSON file with timestamp statistics. |
TestDeltaLakeFileStatistics.java | Added tests for parsing Spark default timestamp formats. |
TestCheckpointEntryIterator.java | Introduced tests to validate statistic-based entry pruning. |
DeltaLakeDomains.java | Updated to use a local variable for clarity when matching partition keys. |
DeltaLakeJsonFileStatistics.java | Modified timestamp parsing logic to include a fallback using exception handling. |
CheckpointEntryIterator.java | Replaced partitionConstraint with a union-based domains variable. |
TransactionLogParser.java | Made readPartitionTimestampWithZone public to support wider usage. |
TransactionLogAccess.java, FileBasedTableStatisticsProvider.java, DeltaLakeSplitManager.java, DeltaLakeMetadata.java | Updated to leverage the union of constraints to filter entries earlier. |
AddFileEntry.java | Refactored statistics extraction into a static helper method. |
Comments suppressed due to low confidence (2)
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeJsonFileStatistics.java:181
- The use of exception-driven fallback in readStatisticsTimestampWithZone may impact performance if exceptions occur frequently. Consider refactoring the logic to avoid relying on exceptions for normal control flow.
ZonedDateTime zonedDateTime;
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java:186
- The access modifier of readPartitionTimestampWithZone has been changed from package-private to public; please confirm that this change is intentional and that the broader API exposure is acceptable.
static Long readPartitionTimestampWithZone(String timestamp)
9f644ba
to
41422a9
Compare
...in/java/io/trino/plugin/deltalake/transactionlog/statistics/DeltaLakeJsonFileStatistics.java
Outdated
Show resolved
Hide resolved
41422a9
to
63cd8f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR extends the optimization for pruning AddFile entries in Delta checkpoint files by applying column statistics filtering earlier in the processing pipeline. Key changes include:
- Enhancing test coverage with new test cases for statistic-based pruning.
- Refactoring method signatures to introduce the non-partition constraint and merging it with the partition constraint.
- Updating logic in various modules (checkpoint iterator, split manager, metadata, etc.) to incorporate statistics-based filtering.
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TestCheckpointEntryIterator.java | Added new tests for early pruning via column statistics. |
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeSplitManager.java | Updated method signatures to handle non-partition constraints. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/util/DeltaLakeDomains.java | Adjusted partition value retrieval with an explicit null check. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java | Refactored constraint handling and inlined statistics predicate evaluation. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java | Updated active file filtering to use intersected constraints. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java | Passed non-partition constraint to checkpoint log entry generation. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/AddFileEntry.java | Modified stats parsing to return the computed statistics predicate. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/statistics/FileBasedTableStatisticsProvider.java | Adjusted statistics predicate creation call with new parameter. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java | Updated split generation to utilize the new constraint merging logic. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java | Updated table creation and validation calls to include non-partition constraints. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/util/DeltaLakeDomains.java
Outdated
Show resolved
Hide resolved
63cd8f9
to
9545d64
Compare
Started benchmark workflow for this PR with test type =
|
Started benchmark workflow for this PR with test type =
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes lgtm
Can you do some benchmarking with large checkpoints to see whether this additional pruning has significant benefits ?
At best this is saving the parsing of a few fields for AddFilesEntry, so it will be good to evaluate the benefit before we add to the complexity of this code.
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
Description
We already prune AddFile entries based on column statistics when generating splits. This PR extends that optimization by applying pruning earlier — while reading active entries from the checkpoint file.
By filtering out irrelevant entries at this stage, we can reduce memory usage and potentially improve split generation performance, especially for large checkpoint files
Additional context and related issues
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: